Skip to content

feat: Python UDFs: per-session inlining toggle and strict refusal setting#1546

Open
timsaucer wants to merge 6 commits into
apache:mainfrom
timsaucer:pr3-toggle-sender-strict
Open

feat: Python UDFs: per-session inlining toggle and strict refusal setting#1546
timsaucer wants to merge 6 commits into
apache:mainfrom
timsaucer:pr3-toggle-sender-strict

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented May 15, 2026

Which issue does this PR close?

Addresses part of #1517

This is PR 3 of 4. The four PRs stack sequentially on top of this one; subsequent PRs target this branch's tip until it merges.

Follow up PR:

Rationale for this change

PRs 1 and 2 ship Python UDFs inline through the codec. There is a follow-on need: Untrusted-input decoding. A receiver that may read Expr.from_bytes input from an untrusted source must refuse to invoke cloudpickle.loads on the inline payload. (pickle.loads on untrusted input is still unsafe regardless of this toggle — see the security note in the docstrings.)

We resolve this by an on/off switch at the codec level. The codec already sits on every session, so the toggle is naturally per-session.

What changes are included in this PR?

  • Add a toggle at the SessionContext level to enable/disable Python inlining of UDFs, which gets passed through to the codec layer.
  • Add a sender-side context setting to match the worker side context setting
  • Add unit tests, including pickle multiprocessing example that demonstrates need for the sender side context
  • Add timeout to unit test because bad behavior could cause a hang on multiprocessing (courtesy of @ntjohnson1)

Are there any user-facing changes?

Yes, but it's pure addition:

  • SessionContext.with_python_udf_inlining is a new method.
  • datafusion.ipc.set_sender_ctx / get_sender_ctx / clear_sender_ctx are new functions for propagating a configured session through pickle.dumps.

The user-guide page documenting the full pattern (and the multiprocessing / Ray runnable examples) lands in PR 4 of this series.

timsaucer and others added 2 commits May 21, 2026 07:45
…fusal

Adds a per-session toggle that turns inline Python UDF encoding on or
off, plus the supporting plumbing to make it usable through
pickle.dumps.

Codec layer:
  * PythonLogicalCodec / PythonPhysicalCodec gain a python_udf_inlining
    bool (default true) and a with_python_udf_inlining(enabled) builder.
    Each try_encode_udf{,af,wf} short-circuits to inner when the toggle
    is off; each try_decode_udf{,af,wf} that recognizes a DFPY* magic
    on a strict codec returns a clean Execution error instead of
    invoking cloudpickle.loads. The refusal message names the UDF and
    the wire family so an operator can see at a glance whether to
    re-encode the bytes or register the UDF on the receiver.

Session layer:
  * PySessionContext::with_python_udf_inlining(enabled) returns a new
    session whose stacked logical + physical codecs both carry the
    toggle. The Arc<SessionState> is cloned (cheap), only the codec
    pair is rebuilt, so registrations and config stay attached.
  * SessionContext.with_python_udf_inlining(*, enabled) is the Python
    wrapper. enabled is keyword-only because positional booleans at
    the call site read as opaque.

Sender-side context:
  * datafusion.ipc gains set_sender_ctx / get_sender_ctx /
    clear_sender_ctx thread-locals. Expr.__reduce__ now consults
    get_sender_ctx() to pick the codec for outbound pickles, which is
    the only path through which a strict session affects pickle.dumps
    (the protocol calls __reduce__ with no arguments). Without a
    sender context the default codec is used.

Tests:
  * test_pickle_expr.py picks up TestPythonUdfInliningToggle (covers
    both directions of the toggle plus the explicit-ctx fast path),
    TestWorkerCtxLifecycle (set/clear/threading), and
    TestSenderCtxLifecycle.
  * New test_pickle_multiprocessing.py + helpers exercise the full
    driver -> worker round-trip on a multiprocessing.Pool with set_*_ctx
    installed in the worker initializer.
  * CI workflow gets a 30-minute timeout-minutes backstop so a hung
    pickle worker can't block the matrix indefinitely.

User-guide docs and the runnable examples land in PR4 of this series.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer force-pushed the pr3-toggle-sender-strict branch from d680e12 to 14178db Compare May 21, 2026 11:52
timsaucer and others added 4 commits May 21, 2026 08:19
Rewrite with_python_udf_inlining docstring for readability and remove
references to /user-guide/io/distributing_work, which does not exist
yet. Keep security warning inline as a .. warning:: Security block,
matching the existing pattern in Expr.to_bytes / from_bytes /
__reduce__. The central doc will land in a follow-on PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CLAUDE.md, every Python function needs a docstring example.
Adds examples to with_python_udf_inlining, set_sender_ctx,
clear_sender_ctx, and get_sender_ctx. Also clarifies that
with_python_udf_inlining returns a new SessionContext and leaves
the original unchanged, matching the with_logical_extension_codec
pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* codec: strict refusal routes through `read_framed_payload` so
  malformed inline bytes surface their own diagnostic; the
  "inlining is disabled" message now fires only when the payload
  would have decoded.
* codec: add summary line above `PythonPhysicalCodec::with_python_udf_inlining`
  cross-link for rustdoc rendering.
* expr: hoist `get_sender_ctx` import to module top; note that
  `__reduce__` also drives `copy.copy` / `copy.deepcopy`.
* context: accept `with_python_udf_inlining` positionally or as
  kwarg (drop `*,`).
* tests: replace size-ratio heuristic with semantic check for the
  `DFPYUDF` family prefix; switch single-batch closure test to
  `pool.apply`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `SessionContext.with_python_udf_inlining` now keyword-only (`*, enabled`)
  to match the documented call style and the existing doctests/tests.
- `refuse_if_inline` and the three `try_decode_python_*` decoders short-
  circuit on a `starts_with(family)` check before `Python::attach`, so
  plans whose UDFs are not Python-defined no longer pay a GIL acquisition
  per decode call. Semantics preserved: `strip_wire_header` already
  returns `Ok(None)` when the prefix does not match.
- `datafusion.ipc` module docstring wraps the `set_sender_ctx` example in
  `try`/`finally` and notes that the thread-local holds a strong
  reference to the installed `SessionContext` until cleared.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@timsaucer timsaucer changed the title feat: per-session Python UDF inlining toggle + sender ctx + strict refusal (3/4) feat: per-session Python UDF inlining toggle and strict refusal setting May 21, 2026
@timsaucer timsaucer changed the title feat: per-session Python UDF inlining toggle and strict refusal setting feat: Python UDFs: per-session inlining toggle and strict refusal setting May 21, 2026
@timsaucer timsaucer marked this pull request as ready for review May 21, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant